\renewcommand{\baselinestretch}{1.5}
\fontsize{12pt}{13pt}\selectfont

%\chapter[ABSTRACT]{Abstract}
\chapter*{ABSTRACT}
\markboth{英~文~摘~要}{英~文~摘~要}

In recent years, with the introduction and development of neural networks, the performance of Speaker Verification (SV) technology has been significantly improved, and it has been gradually used in the fields of smart home, Internet finance, criminal investigation and other identity authentication tasks. In practical application scenarios, the performance of the system can be significantly degraded under background noise interference and complex and diverse recording environments. In order to improve the robustness of the speaker verification system, the following research work is carried out in this thesis.

\begin{enumerate}
	\item \textbf{Speaker verification based on neural network: }In this thesis, we study several mainstream neural network-based speaker confirmation models, including d-vector, x-vector, ResNet34, and ECAPA-TDNN. Moreover, the influences of speed perturbation, wave perturbation, additive noise, reverb and SpecAugment on the robustness of these models are explored by ablation experiments. On the VoxCeleb1 dataset, ECAPA-TDNN achieved the best performance, obtaining an EER/minDCF of 3.09\%/0.2940. Meanwhile, among the five data augmentation methods, additive noise obtained the best single augmentation result with EER/minDCF of 2.55\%/0.2739, respectively, which was reduced by 11\%/7\% relative to the ECAPA-TDNN baseline system result.

	\item \textbf{Speaker verification in complex scenarios: }The recognition accuracy of the speaker verification system decreases significantly while the system is affected by complex factors such as background noise and reverberation, so the robustness of the speaker verification system in complex scenes is the focus and difficulty of the research. To address the possible problems of speaker verification systems in complex scenario, this paper explores methods to improve the robustness of the system based on ECAPA-TDNN and ResNet34 and their variants, using complementary cross-entropy and contrast loss functions, a combination of convolution and attention mechanisms, and model soup and score fusion strategies. On the Chinese dataset CN-Celeb, the best result of EER/minDCF is 7.83\%/0.4157, which is 11\%/15\% lower relative to the result of baseline system ECAPA\_1024.



\end{enumerate}

\vspace{1em}
\noindent {\textbf{Key Words:}} \quad speaker verification, deep neural network, data augmentation, complex scenarios
\clearpage
\endinput